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Abstract 

Escherichia coli is a highly diverse group of pathogens ranging from commensal of the intestinal tract, through to intestinal pathogen, 
and extraintestinal pathogen. Here, we present data on the population diversity of £ coli, using Bayesian analysis to identify 1 3 distinct 
clusters within the population from multilocus sequence typing data, which map onto a whole-genome-derived phylogeny based on 
62 genome sequences. Bayesian analysis of recombination within the core genome identified reduction in detectable core genome 
recombination as one moves from the commensals, through the intestinal pathogens down to the multidrug-resistant extraintestinal 
pathogenic clone E. coli ST1 31 . Our data show that the emergence of a multidrug-resistant, extraintestinal pathogenic lineage of 
E. coli is marked by substantial reduction in detectable core genome recombination, resulting in a lineage which is phylogenetically 
distinct and sexually isolated in terms of core genome recombination. 
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Introduction 

Escherichia coli is a bacterial species of enormous diversity, 
ranging from a harmless intestinal commensal organism of a 
myriad of animals, through environmental organism, to zoo- 
notic intestinal pathogen, and causative agent of extraintesti- 
nal infections such as urinary tract infection (UTI) and 
bacteremia (Croxen and Finlay 2010). Classical attempts at 
classifying E. coli have centered on simplistic methods such 
as pathotype where strains have been classified as commen- 
sals, EHEC, EPEC, ETEC, EIEC, EAEC, or ExPEC based on the 
disease pathology they are most associated with (Kaper et al. 
2004). Phylogenetically E. coli can be separated into phy- 
logroups based on a small number of discrete genetic markers 
(Clermont et al. 2000), which show a degree of correlation 
with isolation from host or niche. More advanced genotyping 
techniques such as multilocus sequence typing (MLST [Wirth 
et al. 2006]) have highlighted the shortcomings of pathotype 
distinction with sequence types (STs) of E. coli often spanning 
pathotypes (Olesen et al. 2012). The higher discriminatory 
power of MLST also identified more phylogroups within 



E. coli, the borders of which are clouded due to recombination 
(Wirth et al. 2006). 

Indeed, recombination has played a central role in the well- 
described diversity observed within E. coli. The majority of re- 
combination studies have focused on the vast level of hori- 
zontal gene transfer and genetic acquisition across E. coli, 
which is often intrinsic to the pathogenic lifestyle of the or- 
ganism (Dobrindt et al. 2004). Such are the levels of horizontal 
exchange of mobile genetic elements across E. coli that the 
accessory genome of the organism is essentially open (Rasko 
et al. 2008). Neither is the genetic exchange of accessory 
elements limited within subgroups of E. coli such as ST or 
phylogroup with toxigenic phage, pathogenicity islands, and 
antimicrobial resistance plasmids transcending across all sub- 
group boundaries as exemplified by the mosaic genomics of 
the E. coli 01 04 outbreak strain (Rasko et al. 201 1). This re- 
combination-derived mosaicism has presented a problem in 
untangling the population structure of E. coli and the evolu- 
tionary relationship between the various pathogenic variants. 
Furthermore, because most studies of recombination in E. coli 
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have focused on the transfer of accessory elements between 
pathotypes, very little is known on how recombination in the 
core genome of E. coli varies across the population or how 
that variation is related to pathogenesis or niche. Creating a 
better understanding of core genome recombination has re- 
cently been shown to provide evolutionary insights into the 
important human pathogens Enterococcus faecium, Strepto- 
coccus pneumonia, and Neisseria spp. (Hanage et al. 2009; 
Corander et al. 2012; Willems et al. 2012), and it has been 
shown that differences in levels of recombination across a 
population are closely linked with ecological factors. Studies 
based on the diversity of E. coli using MLST suggest that re- 
combination has played a key role in the evolution of virulence 
and the emergence of strains with increased pathogenesis 
(Wirth et al. 2006), whereas studies based on a limited 
number of genome sequences have suggested that both ho- 
mologous and nonhomologous recombination have played a 
role in evolution of pathogenesis, though there is sexual iso- 
lation between phylogroups A/B1, B2, and E (Didelot et al. 
2012). 

In this study, we utilize algorithms designed for estimating 
recombination and population structure in large genome data 
sets, namely BratNextGen (Marttinen et al. 2012) and 
Bayesian population genetics software (BAPS) (Corander 
et al. 2008) to analyze the population structure of E. coli 
and determine how recombination correlates with pathogen- 
esis. We analyzed the entire E. coli MLST data set (mlst.ucc.ie) 
and genome data for 62 E. coli strains representing the se- 
quenced diversity of the organism. The genomes used range 
from commensal K12 laboratory strains, to intestinal patho- 
genic strains, through to strains associated with extraintestinal 
infections such as UTI, and culminating in a number of strains 
of E. coli ST131 . ST131 has emerged over the last decade to 
become the globally dominant strain type associated with ex- 
traintestinal disease and dissemination of multidrug resistance, 
leading to it being termed the pandemic E. coli (Rogers et al. 
2011). By utilizing the most comprehensive set of data and 
analytical tools to date, we provide new insights into recom- 
bination and population structure in E. coli. Whole-genome 
phylogeny shows concordance with traditional phylogroups, 
with advanced Bayesian population analysis of the MLST data 
set for E. coli suggesting the presence of 1 3 separated popu- 
lation clusters, which exhibit admixture throughout. Detailed 
analysis of core genome recombination suggests an evolution- 
ary pattern from ubiquitous intestinal commensal exhibiting 
relatively frequent core genome recombination, through to 
highly specialized extraintestinal pathogen marked by a drastic 
decrease in detectable core genome recombination, and in 
the case of the newly emerged multidrug-resistant ST131, 
an almost stable core genome that is sexually isolated from 
the rest of the species including the most closely related 
phylogroup B2 ExPEC strains. These findings further our un- 
derstanding of the processes involved in evolution of patho- 
genesis within the enterobacteriaceae, illustrating how core 



genome recombination levels correlate to environmental 
niche and pathogenesis in E. coli and provide new avenues 
of research in understanding the emergence of global 
pathogens. 

Materials and Methods 

Genome Data 

A total of 62 publically available E. coli genome sequences 
were used in this analysis (table 1). Fifty are reference genome 
sequences available from NCBI, whereas 12 are ST131 ge- 
nomes produced during previous studies in our group (Clark 
etal. 2012). 

MLST Data 

In an attempt to provide a higher level of resolution to the pop- 
ulation, we performed BAPS analysis using the data available 
on the entire E. coli MLST database as of September 2012 
(supplementary table S1, Supplementary Material online). The 
database contained (accessed 1 September 2012) 2,880 STs 
for which public and nonaberrant allele sequences were 
available. 

Whole-Genome-Based Phylogeny 

Genome sequences were aligned using Mugsy (Sahl et al. 

2011) and the core genome extracted using Mothur 
(Schloss et al. 2009) with the default settings of the methods. 
The resulting alignment was used to determine a maximum 
likelihood (ML) phylogeny using RAxML (Stamatakis et al. 
2005) implementing the rapid bootstrap function and the 
general time reversible (GTR) model with Gamma correction, 
with 100 bootstraps performed. The best tree was imported 
into Figtree (http://tree.bio.ed.ac.uk/software/figtree/, last 
accessed March 31, 2013) for graphical annotation. 

Bayesian Population Genetics Software 

BAPS (Tang et al. 2009) was applied to cluster the MLST data 
into genetically distinct groups and to estimate the level of 
admixture for each ST. The analyses were performed with 
the second-order Markov model and the standard MLST 
data input option, similarly as described in Hanage et al. 
(2009). The optimal clustering was obtained using 10 runs 
of the estimation algorithm with the prior upper bound of 
the number of clusters varying in the range (Corander et al. 

2012) over the 10 replicates. Each of the estimation runs did 
yield a highly congruent partition of the ST data compared 
with the other runs, such that there were exactly 13 clusters, 
indicating a highly peaked posterior distribution in the neigh- 
borhood of these partitions (estimated posterior probability 
1 .000). The admixture analysis was subsequently performed 
using the 1 3 clusters in the estimated posterior mode partition 
with 1 00 Monte Carlo replicates for allele frequencies and by 
generating 100 reference genotypes to calculate P values. For 
reference cases, we used 10 iterations in estimation according 
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Table 1 



List of Escherichia coli Genome Data Used in This Study 



Strain 


ST 


BAPS 


Pathotype 


Accession 






Cluster 




Number 


fc. CO// CEIO 


62 


8 


ixi Cwncr 

K.1 ExPEC 


M/™ A1 Id AC 

NC_0 1764b 


c. CO// DH1 -ECDH1 


1,060 


5 


K12 


CP001b37.1 


c. CO// IVIEoDDy 


1,060 


5 


K12 


APOIzOJO.1 


c. CO// Dm OB 


1,060 


5 


K12 


NC_01 04/3.1 


c. co// wjno 


10 


5 


K12 


A /"AAAAH -1 

AC000091 


t. CO// M(j1 655 


10 


5 


K12 


U0009b.2 


C. CO// DW2952 


10 


5 


K12 


/~oaai one A 

CP00139b.1 


t. COll PI 2D 


10 


5 


K12 


NC_01 /bb3.1 


C l: l_l 1 A/1 A"7 

c. CO// HI 040/ 


48 


5 


ETEC 


CMC/IO/1 1 A 

H\lb49414 


C. COll UMNK.88 


100 


5 


K88 


CP002/29.1 


c. CO// KELbOb 


93 


5 


B 


/"nAAAOin 1 
CP000819.1 


t. COll BL21 DE3 


93 


5 


B 


/"OAA1 CAA O 

CP001 509.3 


c. CO// AICC9b3/ 


1,079 


3 


W 


/"HAATI OC A 

CP002 185.1 


E. coli SE1 1 


156 


3 


Human commensal 


AP009240.1 


/- _ _ ,/; | — > /l ~7~7 A 

t. CO// E23477A 


1,132 


3 


ETEC 


CP000800.1 


c /; i a 1 a 
C. CO// I All 


1,128 


3 


0:8 


/™i imo i ca n 

CU9281b0.2 


t. CO// 55989 


678 


1 


EAEC 


CU928145.2 


c. CO// LZZ7_\ 1 


678 


1 


01 04 


A C D l_l A A A AAA A A 

Ar R H 00000000 


t. co// 12009 


17 


1 


/™\ i nn r~ i i r~ /~ 

O103 EHEC 


AP01 0958.1 


c /; AAA io 

c. CO// 11 1 28 


16 


3 


r\A a a v i ] v r~ 

Olll EHEC 


AHA1AACA 1 

AP0109b0.1 


t. CO// 1 1 3b8 


21 


3 


026 EHEC 


AP0 10953.1 


fc. CO// ATCC87 39 


1,120 


5 


K12 


rnAAAAyic 1 

CP00094b.1 


£ CO// HS 


46 


7 


Human commensal 


rnAAAOAn 1 

CP000802.1 


fc. CO// Co9b15 


335 


9 


055 EHEC 


rnAA 1 OAC A 

CP00184b.1 


C — /," ci"M n ~> ~> 

fc. CO// EDL933 


1 1 


9 


0157 EHEC 


ACf\f\CA~7A ~> 

AE005174.2 


£. co// Sakai 


1 1 


9 


r\A c"7 cucr 
□ 15/ EHEC 


D A AAAAA"7 ~) 
BA00000/.2 


fc. CO// I W 143 59 


1 1 


9 


c"7 cucr 
U15/ EHEC 


rnAA 1 OCO 1 

CP0013b8.1 


c m i: ct~AA ac 
fc. COll EC41 1 5 


1 1 


9 


□ 15/ EHEC 


NC_011 353.1 


fc. co// Xuznouzl 


1 1 


9 


-1 r-~7 CUES' 

□ 15/ EHEC 


M/~ A1"7AAC A 

NC_01 /90b. 1 


f. CO// RM 12579 


335 


9 


055 EHEC 


NC_017b5b.1 


fc. CO// 042 


414 


6 


EAEC 


rN554/bb.1 


i— __/; i i iv vim /">■->/" 

f. CO// UMN026 


597 


6 


0:7 ExPEC 


CU9281b3.2 


f. CO// 5M535 


354 


8 


Multidrug resistant 


CP000970.1 


fc. CO// E2348/b9 


15 


4 


r\A T7 mc/™ 

0127 EPEC 


FMIoOddo-I 


i- /; 1 iti i o 

f. CO// UTI18 


1 3 1 a 


4 


ExPEC 


EKP001095 


fc. CO// Ecyss 


131 


4 


ExPEC 


/"A CI A1AAAAA1 

CAFL0 1000001 


c. CO// NA1 14 


131° 


4 


ExPEC 


CPOOz/y/.l 


c. CO// PzU 


131 


4 


ExPEC 


rnvi cai aa 


fc. CO// P5U 


131 


4 


ExPEC 


r~n \/ 1 cai act 

ERX15yi0o 


fc. CO// P2B 


131 


4 


ExPEC 


r~n \/ 1 caaaa 

ERX1 59099 


fc. CO// UTI24 


1 3 1 a 


4 


ExPEC 


p n n a a iaac 

ERP001095 


fc. COll U 1 132 


1 3 1 a 


4 


ExPEC 


CDBAA1 A AC 

EKP001095 


fc. CO// UTIb2 


1 3 1 a 


4 


ExPEC 


p n n a a iaac 

ERP001095 


c — /; i iti <i oo 
fc. COll U 1 11 88 


1 3 1 a 


4 


ExPEC 


CDDAA1 A AC 

EKP001095 


C. CO// U 1 122b 


1 3 1 a 


4 


ExPEC 


CDDAA1 AAC 

EKP001095 


#- /; | ITI D AC 

t. COll U 1 130b 


1 3 1 a 


4 


ExPEC 


CDDAA1 AAC 

EKP001095 


n m i: i iti/itd 
c. COll U 1 1423 


1 3 1 a 


4 


ExPEC 


CDDAA1 AAC 

EKP001095 


C rr\\i 1 ITKQ7 
t. CO// U 1 1 JO/ 


1 3i a 

I D I 


A 

4 


LXrLL. 


PDDnmnQ^ 
LKrUU iuyj 


C // CC1C 

c. CO// 5E15 


1 3 1 a 


4 


Human commensal 


A OAAAD"70 A 

AP0093/8.1 


t. CO// LF82 


135 


4 


AIEC 


NC_01 1993.1 


E. co// IHE3034 


95 


4 


ST95 ExPEC 


CP001 969.1 


£. co// UTI89 


95 


4 


ST95 ExPEC 


CP000243.1 


E. coli S88 


95 


4 


045 ExPEC 


CU928161.2 


£. co// APEC01 


95 


4 


APEC 


CP000468.1 


E. coli UM146 


643 


4 


AIEC 


CP002 167.1 


£. coli 536 


127 


4 


06 ExPEC 


CP000247.1 


£. co// LF82 


135 


4 


AIEC 


CU651 637.1 


£ co// NRG857c 


135 


4 


AIEC 


CP001 855.1 


£ co// ED 1a 


452 


4 


081 


CU928162.2 


ABU83972 


73 


4 


Asymptomatic 


CP001671 


£ co// CFT073 


73 


4 


ExPEC 


AE014075.1 


£ co// Di14 


73 


4 


ExPEC 


CP002212.1 


£ co// Di12 


73 


4 


ExPEC 


CP002211.1 



a 2009 UK ST131 isolates. 
b 2004 UK ST131 Isolate. 
c lndian ST131 isolate. 
d 2012 UK ST131 isolates. 



to the guidelines of Corander and Marttinen (2006). STs were 
concluded to be significantly admixed when the P value did 
not exceed the threshold of 5%. 

MLST-Based Phylogeny 

The phylogenetic distribution of the BAPS clusters was deter- 
mined using an ML tree estimated with FastTree (Price et al. 
2009) using the default settings (1,000 bootstrap replicates 
with the general time-reversible model and Gamma model for 
rate heterogeneity) based on the concatenated MLST data 
over the seven loci for all identified MLST STs. 

BratNextGen 

Software package BratNextGen (Marttinen et al. 2012) was 
used to determine recombining regions in the whole-genome 
data comprising the 62 sequences. The estimation was per- 
formed with the default settings as in Marttinen et al. (2012) 
using 10 iterations of the estimation algorithm, which was 
assessed to be sufficient because changes in the hidden 
Markov model parameters were already negligible over the 
last couple of iterations. Significance of a recombining region 
was determined as in Marttinen et al. (2012) using a permu- 
tation test with 100 permutations executed in parallel on a 
cluster computer (threshold of 5% was used to conclude sig- 
nificance for each region). 

Statistical Testing of Significance in Differences of 
Recombination Levels 

To investigate whether the observed differences in the 
amount of recombination are accountable by random varia- 
tion within and between lineages, we performed standard 
permutation tests. For a given labeling of strains into two 
groups, we calculated first the absolute value of the difference 
in the mean estimated amount of recombination between the 
two groups. This resulted in the observed statistic T_obs. Then, 
under the null hypothesis of no systematic difference in re- 
combination between the groups, the group labels were ran- 
domly permuted among strains and the corresponding value 
of the statistic T_perm was calculated. To obtain a P value for 
T_obs under the null hypothesis, the permutation procedure 
was repeated 1,000,000 times, which yields an estimate of 
the probability P(T_perm > T_obs). 

Phylogeny of Recombining Regions and Core Genomes 
with Recombining Regions Removed 

FastTree was used with the same settings as for the MLST to 
determine the phylogeny of significantly recombining regions 
and core genomes with significantly recombining regions re- 
moved. For each group of strains defined by the MLST BAPS 
clustering, the fractions of recombinations shared within and 
between groups were determined from the BratNextGen 
output. In addition, the r/m ratio was estimated for each 
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group from the number of single-nucleotide polymorphisms 
(SNPs) residing within and outside of the recombinant 
segments. 

Results 

Whole-Genome Phylogeny of E. coli and Determination 
of Population Structure from MLST Data 

To determine the phylogeny of the 62 £ coli strains in our 
analysis (table 1), a whole-genome alignment was performed 
using Mugsy and the core genome extracted to infer phylog- 
eny using RaxML. The core genome is composed of 
2,336,639 bp and shows concordance with previous E. coli 
whole-genome phylogenies (Rasko et al. 201 1 ) with hierarchi- 
cal clustering based around Phylogroup (fig. 1). The E. coli 
ST131 strains also show very tight clustering in concordance 
with previous findings (Clark et al. 2012), though additionally 
this phylogeny places strains from the United Kingdom and 
India isolated between 2004 and 2010 (Avasthi et al. 2011; 
Totsika et al. 201 1 ; Clark et al. 201 2) in a monophyletic clade 
exhibiting very low diversity, furthering the suggestion that 
ST131 may be a globally disseminated clone. However, 
there is also the formation of a second cluster of E. coli 
ST131 strains comprising the reference genome strain SE15 
and two additional strains isolated in the United Kingdom in 
2012 none of which exhibit antimicrobial resistance. 

In an attempt to provide a higher level of resolution to the 
population, we performed BAPS analysis using the data avail- 
able on the entire E. coli MLST database (mlst.ucc.ie) as of 1 
September 2012 (supplementary table S1, Supplementary 
Material online), which resulted in the identification of 13 
BAPS clusters. The database contained 2,880 STs for which 
public and nonaberrant allele sequences were available. The 
phylogenetic distribution of the clusters was determined using 
an approximate ML tree estimated with FastTree (fig. 2). This 
shows that recombination has blurred the boundaries of line- 
ages to a considerable degree but not uniformly over all the 
lineages. Notably, apart from a small subset of STs within 
BAPS cluster 4, this cluster forms a monophyletic clade. 
Mapping of BAPS clusters onto the whole-genome phylogeny 
(fig. 1 and supplementary fig. S1, Supplementary Material 
online) identified BAPS cluster 4 isolates as all belonging to 
phylogroup B2 and all being ExPEC strains with the exception 
of E234869, which is the reference 0127 EPEC strain. All K12 
strains are contained within BAPS cluster 5, except HS that 
belongs to BAPS cluster 7, and all 055 and 01 57 EHEC strains 
are within BAPS cluster 9. The phylogroup B1 clade contains 
two discrete BAPS clusters within it. The majority are within 
BAPS cluster 3 except the two EAEC strains 55989 and 
C227J1 (£. coli 01 04) and strain 12009 (01 03 EHEC), 
which are in BAPS cluster 1 . This population grouping con- 
firms that pathotypes are not a robust way to differentiate 
£ coli and that phylogroups can also be distributed across the 



population. Our data provide a population framework to 
MLST supporting 13 distinct populations and in particular a 
clearly distinct BAPS cluster 4 containing only phylogroup B2 
extraintestinal pathogenic STs. Determination of the levels of 
admixture across the BAPS clusters based on MLST data 
(fig. 3) supports the idea of discrete clusters but with signifi- 
cant recombination. A summary of the admixture results is 
given in supplementary table S2, Supplementary Material 
online. Notably, BAPS MLST cluster 4 is the sole cluster harbor- 
ing STs from a single ancestral group only, B2 (supplementary 
table S4, Supplementary Material online). For a majority of the 
BAPS clusters, several ancestral groups are found within a 
single cluster. Among all clusters with at least 50 STs assigned 
to allow for more robust estimation of subpopulation charac- 
teristics, the frequency of admixed STs is smallest in cluster 4, 
and furthermore, the mean fraction of DNA atypical for the 
cluster is also smallest. 

Quantification of Recombination with BratNextGen 
Shows Varied Recombination Frequencies across BAPS 
Clusters 

To further examine the level of recombination across the BAPS 
clusters, we interrogated our data set comprising 62 aligned 
whole genomes using BratNextGen (fig. 4). The results clearly 
show an uneven level of recombination across the BAPS clus- 
ters with the £ coli ST1 31 clade within BAPS cluster 4 display- 
ing very little recombination with an average value of just 
0.39% of the core genome undergoing recombination. 
There is then an increase in recombination moving into the 
remaining BAPS cluster 4 ExPEC isolates and BAPS cluster 9 
EHEC strains, and onward into the BAPS cluster 1,3,6, 7, and 
8 strains associated with intestinal disease, culminating in the 
BAPS cluster 5 K12 strains exhibiting the highest level of re- 
combination at an average 2.19% of the core genome. 
Quantitative analysis of differences between the recombina- 
tion levels of the BAPS clusters was performed using permu- 
tation tests (fig. 5), which showed that the resistant ST131 
strains within BAPS cluster 4 had significantly less recombina- 
tion than the other strains with multiple isolates present ST73, 
ST95, and ST135 (P= 0.000001) and that the K12 strains of 
BAPS cluster 5 (ST10, ST1060) exhibited significantly higher 
levels of core genome recombination than strains in ST73, 
ST95, and ST135 (P= 0.00001). This is indicative of a sliding 
scale of core genome recombination, starting with high levels 
in the commensal K12 strains, reducing into the intestinal 
pathogenic strains, further still into the highly virulent intesti- 
nal pathogenic EHEC and the extraintestinal pathogenic 
strains, culminating in the pandemic, multidrug-resistant 
ST131 extraintestinal pathogenic strains. To ensure that 
these differences in core genome recombination cannot be 
explained away by reduced level of genome diversity within 
any particular MLST lineage included in the comparisons, we 
calculated from the nonrecombinant genome segments 
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# BAPS 1 

• BAPS 3 

□ BAPS 4 ST1 3 1 

BAPS 4 Non-ST131 
0 BAPS S 



^ BAPS e 

^ BAPS "7 

0 BAPS 8 

+ BAPS 9 



^/^^♦/^r/m 0.3163 
C8c ♦ 




Fig. 1. — Whole-genome-based phylogeny of Escherichia coli. The phylogeny is based on approximately 2.3-Mbp core genome aligned using Mugsy, 
with an ML phylogeny determined using RAxML. Classical phylogroup is indicated at each clade of the tree with the appropriate capital letter. Strains are 
further color coded according to allocated BAPS cluster (yellow = BAPS 4, ST131 isolates with black border; brown = BAPS 7; purple = BAPS 5; light 
blue = BAPS 3; red = BAPS 1 ; dark blue = BAPS 9; cyan = BAPS 6; green = BAPS 8). The calculated r/m value for each major clade is also presented. 



relative SNP distances between all pairs of isolates within each 
lineage (ST73, ST95, ST135, and resistant ST131). Two-sample 
Kolmogorov-Smirnov (K-S) test was used to examine whether 
the distribution of SNP distances was markedly different be- 
tween any pair of these lineages. Notably, mean relative SNP 
distance between two isolates was highest within the resistant 



ST131 among these four MLST lineages, thus indicating that 
the identified reduction in core genome recombination 
cannot be plausibly explained by the lineage being "younger" 
and less diverse than the other lineages. K-S test yielded a 
nonsignificant result when the distributions of relative SNP 
distances were compared between resistant ST131 and 
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Fig. 2. — ML phylogeny of the Escherichia coli population based on concatenated MLST data. Each taxa in the tree is a different ST present in the E coli 
MLST database and is color coded by its allocated BAPS cluster. 




Fig. 3. — Graphical representation of genetic admixture between the Escherichia coli BAPS population clusters as determined by BAPS analysis of MLST 
data. The colors on the fringes of each cluster denote introgression of DNA from that source cluster into the recipient cluster. 
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1 .95% 
1 .23% 



2.19% 
0.86% 



1 .07% 
1.13% 



■ ' ' 'I 

I "j 'I ' 

I I 



I 1 I 
I 

1 1 I 



I 



0.5Mb 1Mb 1.5Mb 2Mb 

Fig. 4. — Graphical representation of recombination across 62 Escher- 
ichia coli genome sequences as determined by BRATNextGen analysis. 
Colored bars on the left indicate the BAPS cluster to which each strains 
in the analysis belongs to (yellow = BAPS 4; brown = BAPS 7; 
purple = BAPS 5; light blue = BAPS 3; red = BAPS 1; dark blue = BAPS 
9; cyan = BAPS 6; green = BAPS 8). Each strain in the analysis is a dash 
on the y axis of the diagram. The x axis is marked by base pair position 
relative to the core genome pseudomolecule formed from the whole- 
genome alignment. Bars in the diagram represent regions of recombina- 
tion detected within the core genome of each strain, with the color coding 
of the bars allocated in an arbitrary manner. The figure to the left of BAPS 
color indicator denotes the average percentage of recombination in the 
core genome for that lineage. 



ST73, ST95 (P= 1.000), whereas significant difference was 
observed between ST1 35 and the other lineages (P= 0.0001 ). 

Our BratNextGen analysis also appears to suggest that the 
recombining regions in each BAPS cluster are largely specific 
to that particular cluster and that there is a degree of sexual 
isolation between the clusters. This has been suggested before 
with respect to phylogroups A+ B1, B2, and E (Didelot et al. 
2012), but our analysis indicates this could be occurring at a 
level beyond phylogroups. To determine whether core 
genome recombination is sexually limited across BAPS clusters, 
we extracted the recombining regions from our core genome 
data set and inferred phylogeny from them using FastTree 
(fig. 6). The resulting phylogeny mirrors that of the entire 
core genome supporting the suggestion that there is no sig- 
nificant recombination between BAPS clusters at the core 
genome level. To confirm the phylogeny, we calculated the 
proportion of recombinant segments in each BAPS cluster 
(including a separate calculation for ST131), which were in- 
tracluster specific, and the proportion of recombinant seg- 
ments, which were shared across clusters, or intercluster 
recombination events. The resulting data clearly support the 
theory of recombination being favored to occur between 



strains from the same cluster within BAPS cluster 4 and 9, 
with ST131 showing a large bias toward intra-ST131 recom- 
bination. Together our data provide extra insights into the 
sexual isolation of the ST131 group, which phylogenetically 
is very tightly clustered and at a further distance from all other 
strains than at a whole-genome level. 

We examined the areas of the core genome in which re- 
combination was detected in each BAPS cluster by mapping 
the regions onto an annotated pseudomolecule of the core 
E. coli genome, highlighting the presence of the majority of 
recombination events in CDS, though with some intergenic 
regions also recombining. There was no physical clustering of 
recombination in distinct regions of the chromosome, which 
may be suggestive of hotspots or multiple insertions via a 
single recombination event in any of the 62 strains. 
Similarly, there was no association with recombination in 
any functional category of gene in any BAPS cluster nor was 
there any association with a particular gene in any cluster, 
which might infer some obvious biological relevance regard- 
ing niche or pathogenesis. This is in contrast to a recent article 
suggesting recombination hotspots in the rib operon, fimA, 
and the aroC locus (Didelot et al. 201 2); however, we make no 
comment on the validity of these findings. Indeed, further 
analysis of all the recombinant regions across all 62 genomes 
is currently the focus of a significant body of work, in the hope 
it may add to our hypothesis on the role of ecology in defining 
the recombination patterns described here. 

Discussion 

Escherichia coli is a highly diverse organism, which ranges 
from the intestinal commensal K12 strains, through to intes- 
tinal pathogenic variants such as ETEC, EAEC, and EPEC, to 
severe intestinal pathogens such as E. coli 0157, and then 
extraintestinal pathogenic variants causing UTIs and bac- 
teremia. This enormous diversity has made E. coli the subject 
of countless comparative and evolutionary studies attempting 
to determine the mechanisms by which each of the subgroups 
has diversified and specialized (Dykhuizen and Green 1991; 
Touchon et al. 2009). The recent emergence of the multidrug- 
resistant E coli 0:25b:H4 ST131 as the globally dominant 
strain type isolated from extraintestinal infections (Peirano 
and Pitout 2010) provides a new and important dimension 
to the study of how E. coli evolves and diversifies. ST131 
strains first came to prominence in the early 21st century as 
the strain type responsible for drug-resistant outbreaks of 
community acquired bacteremia (Lau et al. 2008; Nicolas- 
Chanoine et al. 2008; Ender et al. 2009; Pitout et al. 2009; 
Rooney et al. 2009) since when it has become the dominant 
strain type associated with UTIs (Johnson et al. 2010; Croxall 
et al. 2011; Platell et al. 2011; Zong and Yu 2011). Genome 
sequence data are suggestive of global dissemination of a 
highly successful clone of ST131 (Clark et al. 2012), unlike 
other recently emerged globally successful E. coli such as 
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Fig. 5. — Graph showing the percentage of core genome undergoing recombination in each of the 62 genomes in our analysis. Bars are color coded 
according to the BAPS cluster that strain is allocated to (yellow = BAPS 4, light yellow ST1 3 1 strains within BAPS 4; brown = BAPS 7; purple = BAPS 5; light 
blue = BAPS 3; red = BAPS 1 ; dark blue = BAPS 9; cyan = BAPS 6; green = BAPS 8). Bars with values above the ST1 3 1 and K1 2 strains indicate P values for 
significant difference between that group of strains and all others as determined by standard permutation tests. 



E. coli 01 57, which is a diverse population of organisms (Bono 
et al. 2012). Also unlike 0157, the dominant emergence of 
ST131 does not seem to be linked to any increased virulence 
phenotype (Johnson et al. 2012) or any set of specific or 
unique genetic loci (Clark et al. 2012) other than dissemina- 
tion of CTX-M 15. 

Our reconstruction of the whole-genome-informed phy- 
logeny of E. coli is in good agreement with previously pub- 
lished phylogenies (Rasko et al. 2008, 201 1 ; Lukjancenko et al. 
2010; Didelot et al. 2012) but is the first to contain all the 
ST131 genomes sequenced to date (Avasthi et al. 2011; 
Totsika et al. 201 1; Clark et al. 2012), as well as three new 
genome sequences isolated in the United Kingdom in 2012 
(P5U, P2U, and P2B) and the strain SE1 5 that is published as an 
01 50 strain isolated as a human commensal (Toh et al. 2010) 
but which STs as an ST131 using the MLST scheme. The phy- 
logenetic tree shows ST131 are clustered within the phy- 
logroup B2 ExPEC strains as expected and form two distinct 
groups, the first containing SE1 5 and two of the new UK 201 2 
isolates. The second contains all the 0:25b and multidrug-re- 
sistant isolates spanning an 8-year period and from the United 
Kingdom and India and shows very little diversity, in concor- 
dance with earlier data from our group (Clark et al. 2012). 



Phylogeny alone provides very little detail as to how distinct 
lineages of £ coli arose or indeed what distinct lineages are 
beyond the classical phylogrouping. Our BAPS analysis based 
on all publicly available MLST data suggests the presence of 1 3 
distinct population clusters within the £ coli population. These 
clusters are remarkably well in agreement with the lineages 
detected in the core genome phylogeny (fig. 1), and when the 
concatenated MLST phylogeny and resulting BAPS groupings 
for the whole-genome sequence strains is compared with the 
whole-genome phylogeny, there is direct concordance (sup- 
plementary fig. S1, Supplementary Material online). The major 
exception to this agreement is found within the phylogroup 
B1 where the genome-wide information intermixes strains 
from the BAPS clusters 1 and 3, possibly due to that fact 
that both these clusters are human intestinal pathogens, 
which also colonize the intestinal tracts of livestock, therefore 
sharing ecological niche. ST131 is found in BAPS cluster 4, 
which contains all the phylogroup B2 ExPEC strains in our 
phylogeny and also displays the lowest levels of admixture 
across all the £ coli populations. All other classical phy- 
logroups are disseminated across multiple BAPS clusters. This 
would suggest BAPS cluster 4 is a population of more "clonal" 
strains of £ coli that are linked by their association with 
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Fig. 6. — ML phytogeny of the detected recombining regions in each genome. The recombinant regions as determined by BRATNextGen were extracted 
and concatenated for each genome before being aligned and interrogated using RAxML. Strains are color coded according to their allocated BAPS cluster 
(yellow = BAPS 4; brown = BAPS 7; purple = BAPS 5; light blue = BAPS 3; red = BAPS 1; dark blue = BAPS 9; cyan = BAPS 6; green = BAPS 8). Pie charts 
next to clades indicate the proportion of recombining regions, which are intra-BAPS cluster specific (red in white) and inter-BAPS cluster recombination 
events (blue in white). 



extraintestinal infection and that is marked by a reduction in 
admixture events from outwith the subpopulation. To confirm 
this further, we conducted a fuller investigation of recombi- 
nation at the whole-genome level using BRATNextGen. The 
whole-genome recombination analysis clearly shows the 
ST131 strains within BAPS cluster 4 having a marked decrease 
in the level of detectable core genome recombination when 
compared with the other E. coli in our analysis, though the 
levels in the non-ST131 BAPS cluster 4 strains are similar to 
those observed in the intestinal pathogens with the exception 
of BAPS cluster 9 containing E. coli 01 57 and its direct ances- 
tral relative E. coli 055 (Leopold et al. 2009), which display the 
next lowest levels of detectable core recombination after 
ST131. The highest levels of core genome recombination 



are found in K12 strains located in BAPS cluster 5, which is 
perhaps unsurprising given many of these are derivatized lab- 
oratory strains and intestinal commensals. The finding of 
reduction in recombination associated with increased patho- 
genesis in £ coli is in direct contrast to the findings of Wirth 
et al. (2006), which utilized MLST data and population genetic 
analysis on the data set to infer that increased virulence in 
£ coli was a result of increased recombination. The discrep- 
ancy between our study and theirs may be due to level of 
resolution afforded by the analysis of dozens of complete 
and draft genome sequences, as well as the Bayesian analysis 
programs utilized in our study, providing a detailed analysis of 
recombination across entire core genomes rather than a very 
small subset of selected genes as in MLST. This argument is 
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strengthened by the fact that our data presented here are in 
agreement with that of Didelot et al. (2012) in that recombi- 
nation patterns indicate a trend toward sexual isolation in 
phylogroup B2. 

Our finding of reduction in recombination of the core 
genome in the globally disseminated, multidrug-resistant 
ST131 clone is in direct agreement with recent findings in 
other pathogens. The most striking analogy is with hospital- 
associated strains of Ent. faecium displaying increased 
virulence and antimicrobial drug resistance, which have 
arisen through limited recombination events, followed by a 
marked decrease in detectable core genome recombination 
leading to clonal expansion of a successful variant (Willems 
et al. 2012). The successful spread of such a clonal lineage 
could be due to a selective advantage for strains, which avoid 
the loss of advantageous phenotypes via recombination. 
Another observation from our data is how the recombination 
that is detected is primarily limited to within BAPS clusters in 
E. coli, which is seen most clearly when a phylogeny is derived 
from the recombining regions alone, mirroring the core 
genome phylogeny, and the levels of intracluster recombina- 
tion are determined. This would support a similar pattern of 
recombination in evolution of E. coli as that described in the 
hospital-associated Ent. faecium, with recombination being 
restricted as a pathogenic clone becomes successful and 
more niche adapted. As such in £ coli, we observe high 
levels of recombination in the intestinal commensal K12, re- 
ducing through the intestinal pathogens, and then restriction 
and sexual isolation as the dominant ST131 clone emerges as 
an extraintestinal pathogen with high levels of multidrug re- 
sistance. Of course such restriction is limited to the core 
genome as it is well known that accessory genome elements 
such as plasmids and phage easily transfer across our deter- 
mined BAPS clusters as is seen with the CTXM-1 5 plasmid and 
the shiga-toxin-like phage crossing into the £ coli 01 04 strain 
(Rasko et al. 201 1). However, transfer of pathogenicity islands 
may actually also be restricted if one considers that the islands 
associated with uropathogenicity are only found in phy- 
logroup B2 £ coli (Lloyd et al. 2007, 2009). Given that 
ST131 is a classical phylogroup B2 strain containing classical 
ExPEC-specific virulence factors and that the main difference 
between ST131 and its near BAPS cluster 4 neighbors is ex- 
tended dissemination of the CTX-M-15 ESBL, it seems most 
likely that ST1 31 has specialized via horizontal gene transfer to 
become a multidrug-resistant ExPEC and then recombined 
less due to ecological or genetic factors. The emergence of 
this strain as a highly successful and dominant lineage would 
then most probably be as a result of selection through drug 
resistance. 

When one considers the reasons for sexual restriction of 
admixture in £ coli ST131, then there are two main plausible 
discussion points. The first would be mechanistic barriers to 
cross BAPS cluster admixture via physical prevention or a grad- 
ual increase in genetic incompatibility as there is a selective 



advantage to reduce recombination and limit loss of advanta- 
geous genes. The second would be ecological isolation by 
reduced opportunity to meet and recombine with strains of 
other BAPS clusters. Fully addressing this question requires 
further focused research, particularly in furthering our inade- 
quate understanding of the ecology of human pathogenic 
£ coli and especially ExPEC. It has been reported that ST131 
is found in companion animals (Ewers et al. 2010; Platell, 
Johnson, et al. 201 1; Platell et al. 201 1) and in poultry food 
products (Vincent et al. 2010); however, the actual route of 
dissemination and transmission is still not known for certain 
with some studies suggesting UTI caused by £ coli may some- 
times be transmitted as a sexually transmitted infection 
(Foxman 2010) as opposed to the classical fecal-urethra I 
route considered clinical dogma. A favoring for the role of 
ecological separation leading to the detected recombination 
pattern in £ coli may be taken when one considers the cluster 
with the next lowest level of detectable core recombination to 
ST131, that of BAPS cluster 9. This cluster contains £ coli 
0157 strains and their ancestral 055 relatives. Escherichia 
coli 01 57 is a pathogenic variant, which has become globally 
successful. Moreover, £ coli 0157 is ecologically distinct in 
that it causes acute infections in humans leading to a transient 
and brief colonization of the intestinal tract and is found to 
only colonize the recto-anal junction of livestock as opposed 
to the more microbially rich intestinal lumen (Naylor et al. 
2003). This would suggest that the lineage has less opportu- 
nity to recombine with distant £ coli lineages due to reduced 
opportunity to interact in the mammalian intestinal tract. 
Similarly, the recent observations on the phylogeography of 
Shigella also fit with our postulated model of HGT-driven di- 
vergence and then ecological separation-driven reduction in 
recombination (Holt et al. 2012). Shigella are a subset of 
£ coli, which have become niche restricted to the human 
intestinal tract where they are highly pathogenic, and display 
extreme monomorphism across a large population set, with 
little to no recombination. Only by better understanding the 
ecology of ST131 would we be able to draw fuller inferences 
and comparisons. 

In conclusion, we present novel data on the population 
structure of £ coli. Recombination analysis on this scale has 
not been previously undertaken for £ coli due to its com- 
putational complexity, and advanced model-based tools are 
the key to unraveling the mysteries of complex evolutionary 
processes. Our data suggest that the emergence of new 
pathogenic and drug-resistant lineages of £ coli are marked 
by reduction in detectable recombination within the core 
genome, which is typified by analysis of the £ coli ST131 
clone. Our data raise new questions on the evolution 
and emergence of new pathogenic £ coli variants and 
opens up new avenues of research to further our under- 
standing of the ecology and interactions of extraintestinal 
pathogenic £ coli 
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Supplementary Material 

Supplementary figure S1, tables S1-S6, and file S7 are avail- 
able at Genome Biology and Evolution online (http:/A/vww. 
gbe.oxfordjournals.org). 
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